This guide will show you how to smartly and quickly crawl the contents of a site
by providing some examples.
All the examples are placed inside the directory examples
and can be run
as follows:
run-example.bat "exampleDirName"
where "exampleDirName" is the name of the directory containing the example configuration files.
The "Google Images" example can be lauched as follows:
run-example.bat googleImages
and makes you able to fetch from http://images.gooogle.com just the
images found by searching for the word "crawler".
You can easily edit and modify this script in order to change the keyword to search for.
The custom configuration file used by this example is named google_images-config.xml
and it can be found in $SMARTCRAWLER_HOME/examples/googleImages/conf
:
<?xml version="1.0" encoding="UTF-8"?> <smartcrawler> <engine> <threadsNumber>5</threadsNumber> </engine> <loggers> <logger type="TRACER" active="yes"/> <logger type="ACCESS" active="yes"/> <logger type="LINK" active="yes"/> <logger type="PERMISSIONS" active="yes"/> <logger type="EXTRACTOR" active="yes"/> <logger type="CONSOLE" active="yes"/> <logger type="PERSISTER" active="yes"/> <logger type="PROVIDER" active="yes"/> </loggers> <retriever> <class>org.smartcrawler.retriever.MultiThreadHttpCallRetriever</class> <filters> <filter> <name>LinkFilter</name> <class>org.smartcrawler.filter.LinkFilter</class> <priority>5</priority> <filter-param> <param-name>links</param-name> <param-value> *images.google.it/images* *.gif *.GIF *.jpg *.JPG </param-value> </filter-param> </filter> </filters> </retriever> <persister> <class>org.smartcrawler.persistence.FileSystemPersister</class> <persister-params> <persister-param> <param-name>preservePath</param-name> <param-value>true</param-value> </persister-param> <persister-param> <param-name>rootDir</param-name> <param-value>google</param-value> </persister-param> </persister-params> <filters> <filter> <name>ImagesLinkFilter</name> <class>org.smartcrawler.filter.ContentTypeLinkFilter</class> <priority>1</priority> <filter-param> <param-name>mime-type</param-name> <param-value>image</param-value> </filter-param> </filter> </filters> </persister> </smartcrawler>
The "New york times RSS" example can be lauched as follows:
run-example.bat nytRss
It will navigate the web site http://www.nyt.com in order to retrieve from
it all the existing rss feeds.
The custom configuration file used by this sample is called nyt_rss-config.xml
and it can be found in $SMARTCRAWLER_HOME/examples/nytRss/conf
:
<?xml version="1.0" encoding="UTF-8"?> <smartcrawler> <engine> <threadsNumber>5</threadsNumber> </engine> <loggers> <logger type="TRACER" active="no"/> <logger type="CONSOLE" active="yes"/> <logger type="ACCESS" active="no"/> <logger type="LINK" active="no"/> <logger type="EXTRACTOR" active="no"/> <logger type="PROVIDER" active="no"/> <logger type="PERMISSIONS" active="no"/> <logger type="PERSISTER" active="no"/> </loggers> <retriever> <class>org.smartcrawler.retriever.MultiThreadHttpCallRetriever</class> <filters> <filter> <name>DefaultLinkFilter</name> <class>org.smartcrawler.filter.DefaultLinkFilter</class> <priority>1</priority> </filter> <filter> <name>LinkFilter</name> <class>org.smartcrawler.filter.LinkFilter</class> <priority>2</priority> <filter-param> <param-name>links</param-name> <param-value> */rss* </param-value> </filter-param> </filter> </filters> </retriever> <persister> <class>org.smartcrawler.persistence.FileSystemPersister</class> <persister-params> <persister-param> <param-name>preservePath</param-name> <param-value>true</param-value> </persister-param> <persister-param> <param-name>rootDir</param-name> <param-value>nyt</param-value> </persister-param> </persister-params> <filters> <filter> <name>XMLLinkFilter</name> <class>org.smartcrawler.filter.ContentTypeLinkFilter</class> <priority>1</priority> <filter-param> <param-name>mime-type</param-name> <param-value>xml</param-value> </filter-param> </filter> </filters> </persister> </smartcrawler>